Journal of Clinical Epidemiology — Latest Matching Preprints

1

An Empirical Assessment of Inferential Reproducibility of Linear Regression in Health and Biomedical Research Papers

Jones, L.; Barnett, A.; Hartel, G.; Vagenas, D.

2026-04-07 health systems and quality improvement 10.64898/2026.04.07.26350296 medRxiv

Top 0.1%

33.4%

Show abstract

Background: In health research, variability in modelling decisions can lead to different conclusions even when the same data are analysed, a challenge known as inferential reproducibility. In linear regression analyses, incorrect handling of key assumptions, such as normality of the residuals and linearity, can undermine reproducibility. This study examines how violations of these assumptions influence inferential conclusions when the same data are reanalysed. Methods: We randomly sampled 95 health-related PLOS ONE papers from 2019 that reported linear regression in their methods. Data were available for 43 papers, and 20 were assessed for computational reproducibility, with three models per paper evaluated. The 14 papers that included a model at least partially computationally reproduced were then examined for inferential reproducibility. To assess the impact of assumption violations, differences in coefficients, 95% confidence intervals, and model fit were compared. Results: Of the fourteen papers assessed, only three were inferentially reproducible. The most frequently violated assumptions were normality and independence, each occurring in eight papers. Violations of independence were particularly consequential and were commonly associated with inferential failure. Although reproduced analyses often retained the same binary statistical significance classification as the original studies, confidence intervals were frequently wider, indicating greater uncertainty and reduced precision. Such uncertainty may affect the interpretation of results and, in turn, influence treatment decisions and clinical practice. Conclusion: Our findings demonstrate that substantial violations of key modelling assumptions often went undetected by authors and peer reviewers and, in many cases, were associated with inferential reproducibility failure. This highlights the need for stronger statistical education and greater transparency in modelling decisions. Rather than applying rigid or misinformed rules, such as incorrectly testing the normality of the outcome variable, researchers should adopt modelling frameworks guided by the research question and the study design. When assumptions are violated, appropriate alternatives, such as robust methods, bootstrapping, generalized linear models, or mixed-effects models, should be considered. Given that assumption violations were common even in relatively simple regression models, early and sustained collaboration with statisticians is critical for supporting robust, defensible, and clinically meaningful conclusions.

2

Challenges in the Computational Reproducibility of Linear Regression Analyses: An Empirical Study

Jones, L. V.; Barnett, A.; Hartel, G.; Vagenas, D.

2026-04-07 health systems and quality improvement 10.64898/2026.04.07.26350286 medRxiv

Top 0.1%

28.4%

Show abstract

Background: Reproducibility concerns in health research have grown, as many published results fail to be independently reproduced. Achieving computational reproducibility, where others can replicate the same results using the same methods, requires transparent reporting of statistical tests, models, and software use. While data-sharing initiatives have improved accessibility, the actual usability of shared data for reproducing research findings remains underexplored. Addressing this gap is crucial for advancing open science and ensuring that shared data meaningfully support reproducibility and enable collaboration, thereby strengthening evidence-based policy and practice. Methods: A random sample of 95 PLOS ONE health research papers from 2019 reporting linear regression was assessed for data-sharing practices and computational reproducibility. Data were accessible for 43 papers. From the randomly selected sample, the first 20 papers with available data were assessed for computational reproducibility. Three regression models per paper were reanalysed. Results: Of the 95 papers, 68 reported having data available, but 25 of these lacked the data required to reproduce the linear regression models. Only eight of 20 papers we analysed were computationally reproducible. A major barrier to reproducing the analyses was the great difficulty in matching the variables described in the paper to those in the data. Papers sometimes failed to be reproduced because the methods were not adequately described, including variable adjustments and data exclusions. Conclusion: More than half (60%) of analysed studies were not computationally reproducible, raising concerns about the credibility of the reported results and highlighting the need for greater transparency and rigour in research reporting. When data are made available, authors should provide a corresponding data dictionary with variable labels that match those used in the paper. Analysis code, model specifications, and any supporting materials detailing the steps required to reproduce the results should be deposited in a publicly accessible repository or included as supplementary files. To increase the reproducibility of statistical results, we propose a Model Location and Specification Table (MLast), which tracks where and what analyses were performed. In conjunction with a data dictionary, MLast enables the mapping of analyses, greatly aiding computational reproducibility.

3

Cochrane Evaluation of (Semi-) Automated Review (CESAR) Methods: Protocol for an adaptive platform study within reviews

Gartlehner, G.; Banda, S.; Callaghan, M.; Chase, J.-A.; Dobrescu, A.; Eisele-Metzger, A.; Flemyng, E.; Gardner, S.; Griebler, U.; Helfer, B.; Jemiolo, P.; Macura, B.; Minx, J. C.; Noel-Storr, A.; Rajabzadeh Tahmasebi, N.; Sharifan, A.; Meerpohl, J.; Thomas, J.

2026-04-15 health informatics 10.64898/2026.04.13.26350802 medRxiv

Top 0.1%

6.3%

Show abstract

Background: Artificial intelligence (AI) has the potential to improve the efficiency of evidence synthesis and reduce human error. However, robust methods for evaluating rapidly evolving AI tools within the practical workflows of evidence synthesis remain underdeveloped. This protocol describes a study design for assessing the effectiveness, efficiency, and usability of AI tools in comparison to traditional human-only workflows in the context of Cochrane systematic reviews. Methods: Members of the Cochrane Evaluation of (Semi-) Automated Review (CESAR) Methods Project developed an adaptive platform study-within-a-review (SWAR) design, modeled after clinical platform trials. This design employs a master protocol to concurrently evaluate multiple AI tools (interventions) against a standard human-only process (control) across three key review tasks: title and abstract screening, full-text screening, and data extraction. The adaptive framework allows for the addition or removal of AI tools based on interim performance analyses without necessitating a restart of the study. Performance will be assessed using metrics such as accuracy (sensitivity, specificity, precision), efficiency (time on task), response stability, impact of errors, and usability, in alignment with Responsible use of AI in evidence SynthEsis (RAISE) principles. Results: The study will generate comparative data about the performance and usability of specific AI tools employed in a semi- or fully automated manner relative to standard human effort. The protocol provides a flexible framework for the assessment of AI tools in evidence synthesis, addressing the limitations of static, one-time evaluations. Discussion: This study protocol presents a novel methodological approach to addressing the challenges of evaluating AI tools for evidence syntheses. By validating entire workflows rather than individual technologies, the findings will establish an evidence base for determining the viability of integrating AI into evidence-synthesis workflows. The adaptive design of this study is flexible and can be adopted by other investigators, ensuring that the evaluation framework remains relevant as new tools emerge.

4

JARVIS, should this study be selected for full-text screening? Performance of a Joint AI-ReViewer Interactive Screening tool for systematic reviews

Barreto, G. H. C.; Burke, C.; Davies, P.; Halicka, M.; Paterson, C.; Swinton, P.; Saunders, B.; Higgins, J. P. T.

2026-04-11 health informatics 10.64898/2026.04.08.26350384 medRxiv

Top 0.1%

5.0%

Show abstract

BackgroundSystematic reviews are essential for evidence-based decision making in health sciences but require substantial time and resource for manual processes, particularly title and abstract screening. Recent advances in machine learning and large language models (LLMs) have demonstrated promise in accelerating screening with high recall but are often limited by modest gains in efficiency, mostly due to the absence of a generalisable stopping criterion. Here, we introduce and report preliminary findings on the performance of a novel semi-automated active learning system, JARVIS, that integrates LLM-based reasoning using the PICOS framework, neural networks-based classification, and human decision-making to facilitate abstract screening. MethodsDatasets containing author-made inclusion and exclusion decisions from six published systematic reviews were used to pilot the semi-automated screening system. Model performance was evaluated across recall, specificity and area under the curve precision-recall (AUC-PR), using full-text inclusion as the ground truth. Estimated workload and financial savings were calculated by comparing total screening time and reviewer costs across manual and semi-automated scenarios. ResultsAcross the six review datasets, recall ranged between 98.2% and 100%, and specificity ranged between 97.9% and 99.2% at the defined stopping point. Across iterations, AUC-PR values ranged between 83.8% and 100%. Compared with human-only screening, JARVIS delivered workload savings between 71.0% and 93.6%. When a single reviewer read the excluded records, workload savings ranged between 35.6 % and 46.8%. ConclusionThe proposed semi-automated system substantially reduced reviewer workload while maintaining high recall, improving on previously reported approaches. Further validation in larger and more varied reviews, as well as prospective testing, is warranted.

5

Evaluating Large Language Models for Transparent Quality-of-Care Measurement in Children with ADHD

Bannett, Y.; Pillai, M.; Huang, T.; Luo, I.; Gunturkun, F.; Hernandez-Boussard, T.

2026-04-17 pediatrics 10.64898/2026.04.12.26350732 medRxiv

Top 0.1%

4.9%

Show abstract

ImportanceGuideline-concordant care for young children with attention-deficit/hyperactivity disorder (ADHD) includes recommending parent training in behavior management (PTBM) as first-line treatment. However, assessing guideline adherence through manual chart review is time-consuming and costly, limiting scalable and timely quality-of-care measurement. ObjectiveTo evaluate the accuracy and explainability of large language models (LLMs) in identifying PTBM recommendations in pediatric electronic health record (EHR) notes as a scalable alternative to manual chart review. Design, Setting, and ParticipantsThis retrospective cohort study was conducted in a community-based pediatric healthcare network in California consisting of 27 primary care clinics. The study cohort included children aged 4-6 years with [≥] 2 primary care visits between 2020-2024 and ICD-10 diagnoses of ADHD or ADHD symptoms (n=542 patients). Clinical notes from the first ADHD-related visit were included. A stratified subset of 122 notes, including all cases with model disagreement, was manually annotated to assess model performance in identifying PTBM recommendations and rank model explanations. ExposuresAssessment and plan sections of clinical notes were analyzed using three generative large language models (Claude-3.5, GPT-4o, and LLaMA-3.3-70B) to identify the presence of PTBM recommendations and generate explanatory rationales and documentation evidence. Main Outcomes and MeasuresModel performance in identifying PTBM recommendations (measured by sensitivity, positive predictive value (PPV), and F1-score) and qualitative explainability ratings of model-generated rationales (based on the QUEST framework). ResultsAll three models demonstrated high performance compared to expert chart review. Claude-3.5 showed balanced performance (sensitivity=0.89, PPV=0.95, and F1-score=0.92) and ranked highest in explainability. LLaMA3.3-70B achieved sensitivity=0.91, PPV=0.89, and F1-score=0.90, ranking second for explainability. GPT-4o had the highest PPV [0.97] but lowest sensitivity [0.82], with an F1-score of 0.89 and the lowest explainability ranking. Based on classifications from the best-performing model, Claude-3.5, 26.4% (143/542) of patients had documented PTBM recommendations at their first ADHD-related visit. Conclusions and RelevanceLLMs can accurately extract guideline-concordant clinician recommendations for non-pharmacological ADHD treatment from unstructured clinical notes while providing clear explanations and supporting evidence. Evaluating model explainability as part of LLM implementation for medical chart review tasks can promote transparent and scalable solutions for quality-of-care measurement.

6

Data sharing policies, requirements, and support from public and private clinical trial sponsors: a survey on top sponsors of clinical trials in Europe

Tai, K. H.; Varvara, G.; Escoffier, E.; Mansmann, U.; DeVito, N. J.; Vieira Armond, A. C.; Naudet, F.

2026-04-01 health informatics 10.64898/2026.03.31.26349853 medRxiv

Top 0.1%

4.8%

Show abstract

Objective To map the presence, public availability, and content of clinical trial data sharing policies (DSP), data management and sharing plans (DMSP), and data use agreements (DUA) among the most prolific public and private clinical trial sponsors operating in the European Union, and to identify key areas of convergence, divergence, and constraint in the context of General Data Protection Regulation (GDPR). Eligibility criteria We included organisation-level documents describing approaches to clinical trial data sharing or data management from the top 20 public and top 20 private sponsors ranked by the number of trials registered in the EU Clinical Trials Information System (CTIS). Eligible materials comprised publicly available or sponsor-shared policies, guidelines, statements, templates, and agreements relevant to clinical trial data sharing or management. Sources of evidence Evidence was identified through systematic searches of sponsors' public websites, structured Google searches, and major data management plan platforms (DMPTool, DMPonline, DMP Assistant), complemented by direct contact with sponsors to verify findings and request missing documentation. All sources were archived and catalogued. Charting methods Two reviewers independently extracted data using a structured form, capturing the existence, accessibility, and content of data sharing policies, data management and sharing plans, and data use agreements. Quantitative data were summarised descriptively, and a non-interpretive descriptive content analysis was conducted to characterise recurring policy elements and areas of heterogeneity. Results Among 40 sponsors, private sponsors were substantially more likely than public sponsors to make trial-specific data sharing policies and data use agreements publicly accessible, often via established data sharing platforms. Public sponsors more frequently referenced data management and sharing plans, but these were heterogeneous in scope and often embedded within broader institutional governance documents rather than tailored to clinical trials. Across sectors, GDPR compliance, data protection, and legal safeguards were emphasised, while operational aspects such as dataset readiness, review criteria, and downstream responsibilities varied widely. Overall response rate to sponsor verification was 37.5%. Conclusion Clinical trial data sharing governance in the EU shows a marked sectoral imbalance among the top sponsors. Private sponsors tend to provide more detailed and operationally explicit documentation, whereas public sponsors often articulate high-level commitments without trial-specific guidance. Greater clarity and standardisation, particularly among public sponsors, could improve transparency and facilitate responsible data reuse, while remaining compatible with GDPR requirements.

7

Therapist effects in real-world rehabilitation outcomes: a cohort study of the nationwide GLA:D osteoarthritis management program in Denmark

Obasohan, P. E.; Palmer, J.; Alderson, D.; Yu, D.; Gronne, D. T.; Roos, E. M.; Skou, S. T.; Peat, G. M.

2026-04-21 rehabilitation medicine and physical therapy 10.64898/2026.04.20.26351120 medRxiv

Top 0.1%

4.4%

Show abstract

ObjectiveUnlike several other fields of healthcare, little is known about the size of therapist effects on patient outcomes following rehabilitation for musculoskeletal conditions. We aimed to estimate the proportion of variance in patient outcomes from a structured rehabilitation program explained by therapist effects. MethodsFor our observational cohort study we accessed data from the national multicentre Good Life with osteoArthritis in Denmark (GLA:D) osteoarthritis management program. Analyses included 23,021 consecutive eligible adults with hip or knee osteoarthritis (mean (SD) age 65.0 (9.8) years, 71% female) treated by 657 therapists between October 2014 and February 2019. The primary outcome was [≥]30% reduction in pain intensity on 0-100 VAS at 3 months. Therapist effects were estimated as the variance partition coefficient (intra-class correlation coefficient (ICC)) from two-level random intercept logistic regression models before and after adjusting for patient-level case-mix factors and therapist-level characteristics (number of patients treated, days since therapist certification). Analyses were repeated for a range of secondary outcomes using multiply imputed data and complete-case analysis. Results52% of patients reported a [≥]30% reduction in pain intensity on 0-100 VAS at 3 months. In the null model the ICC was 0.007 (95%CI: 0.005, 0.009), which changed little after adjusting for patient- and therapist-level covariates. Upper confidence limits for ICC estimates across all secondary outcomes in multiply imputed and complete case analyses were less than 0.03. ConclusionsIn a nationally implemented osteoarthritis management program delivered by trained healthcare professionals, therapist effects made a minimal contribution to variation in patient outcomes. KEY MESSAGESO_ST_ABSWhat is already known on this topicC_ST_ABS Therapist effects - defined as the effect of a given therapist on patient outcomes as compared to another therapist - have been observed in several fields of healthcare and have important consequences for selection, training, and service improvement. In musculoskeletal rehabilitation five previous studies suggest that 1-12% of variation in patient-reported outcomes may be attributable to therapist effects, but these estimates were based on relatively small datasets resulting in substantial uncertainty. What this study addsOur cohort study analysed registry data from 2014-2019 on 23,021 patients and 647 trained therapists from the nationally implemented GLA:D structured osteoarthritis management program in Denmark. We found that therapist effects accounted for less than 3% of total variation in patient-reported pain and quality of life outcomes 3 months after beginning the program How this study might affect research, practice, or policyOur findings suggest that contextual factors that relate to therapist effects - therapist characteristics or therapist-patient interaction and alliance - make a minimal contribution to variation in patient outcomes from this structured, group-based rehabilitation intervention. Any contextual effects must be attributable to alternative sources, e.g. patient expectations, intervention setting.

8

Mapping Evidence Gap Between NMN and NR for Metabolic Outcomes: A Systematic Review, Transitivity Assessment, and Indirect Comparison Meta-Analysis

Nguyen, A. T.; Nguyen, B.

2026-04-09 biochemistry 10.64898/2026.04.07.716917 medRxiv

Top 0.1%

4.1%

Show abstract

BackgroundNicotinamide mononucleotide (NMN) and nicotinamide riboside (NR) are NAD+ precursor supplements widely marketed for metabolic health benefits. Despite billions of dollars in annual sales, no head-to-head randomized controlled trial (RCT) has compared their effects on metabolic endpoints, and no systematic characterization of why reliable comparison is currently impossible has been published. ObjectiveTo characterize the structural heterogeneity of the NMN and NR trial evidence bases across population, dose, duration, and biomarker dimensions; to formally assess transitivity; and to estimate indirect NMN versus NR effects where methodologically feasible using the Bucher indirect comparison method. MethodsFive databases (PubMed, Embase, Scopus, Web of Science, Cochrane CENTRAL) were searched from January 2018 to May 2025. Eligible studies were RCTs of oral NMN or NR versus placebo in adults reporting metabolic outcomes. A formal transitivity assessment was conducted comparing effect modifier distributions across NMN and NR trial arms. Random-effects pairwise meta-analyses were conducted for each precursor versus placebo, and Bucher indirect comparisons estimated NMN versus NR effects through the common placebo node. Risk of bias was assessed using RoB 2 and certainty of evidence using the GRADE/CINeMA framework. ResultsFifteen studies (5 NMN, 10 NR; 740 participants) were included. The NMN and NR trial evidence bases were systematically asymmetric across every major effect modifier: NR was dosed 1.9 to 9.2 times higher than NMN on a molar basis; NMN trials were conducted predominantly in East Asian populations while NR trials were predominantly Western; and available NAD+ pharmacodynamic measures used incompatible assay matrices precluding indirect comparison. Across 14 metabolically comparable outcomes, no indirect comparison reached statistical significance and all were rated Very Low certainty by GRADE/CINeMA, consistent with the structural limitations of the evidence base. Leave-one-out sensitivity analyses showed zero pairwise significance changes and one indirect significance change (triglycerides upon exclusion of Conze 2019). ConclusionCurrent evidence is structurally insufficient to support reliable indirect comparison of NMN and NR for metabolic outcomes. The barriers are quantifiable and modifiable: future head-to-head trials should use equimolar dosing (approximately 1,150 mg NMN is molar-equivalent to 1,000 mg NR), harmonized whole-blood NAD+ assays reported in mol/L, minimum 24 weeks duration, and enrollment of metabolically at-risk populations to generate interpretable comparative evidence. RegistrationPROSPERO 2026 CRD420261330487; registered prior to data screening.

9

Assessing Compliance with Reporting Requirements in European Phase II to IV Clinical Trials: A Cross-Sectional Observational Study

Bruckner, T.; Dike, C. E.; Caquelin, L.; Freeman, A.; Aspromonti, D. A.; DeVito, N.; Song, Z.; Karam, G.; Nilsonne, G.

2026-04-05 health policy 10.64898/2026.04.03.26350111 medRxiv

Top 0.1%

3.7%

Show abstract

Objectives: To assess the availability of key clinical trial registration data and compliance with legal reporting requirements for all Phase 2-4 drug trials registered on the new European Clinical Trial Information System (CTIS) registry. This study is the first ever assessment of data quality and legal compliance with reporting requirements on CTIS. Design: Cross-sectional observational study of CTIS registry data combined with manual review of results documents. Setting: Cohort of all 7,547 Phase II-IV clinical trials registered on CTIS as of November 2025. Main outcome measures: Number and proportion of missing data points in CTIS registration data. Proportion of completed clinical trials that are compliant with regulatory reporting requirements. Results: Trial registration data quality was high overall with more than 99% of expected data present. Of 234 clinical trials legally required to report results, fewer than half (49.6%) fully reported results within the required timeframe, 20 trials (8.5%) fully reported results late, and 98 trials (41.9%) failed to fully report results. Legal compliance was similar for adult trials (79/158) and paediatric trials (37/76). Conclusions: Sponsor compliance with legal reporting requirements is weak. Current efforts by European regulators to monitor and enforce compliance appear to be insufficient. New results reporting functions currently being set up by trial registries worldwide will require quality assurance processes. Trial registration: Study protocol prospectively registered on OSF: https://osf.io/sn4j2/overview

10

Covariate adjustment for hierarchical outcomes and the win ratio: how to do it and is it worthwhile?

Hazewinkel, A.-D.; Gregson, J.; Bartlett, J. W.; Gasparyan, S. B.; Wright, D.; Pocock, S.

2026-03-31 cardiovascular medicine 10.64898/2026.03.30.26347966 medRxiv

Top 0.2%

2.0%

Show abstract

Objectives: Introducing a new covariate adjustment method for hierarchical outcomes using ordinal logistic regression, comparing it with existing approaches, and assessing whether adjustment improves power in randomized trials with hierarchical outcomes. Methods: We developed an ordinal regression-based method for covariate adjustment of the win ratio and compared it with three alternatives: probability index models, inverse probability weighting, and a randomization-based estimator. Methods were applied to the EMPEROR-Preserved rial and tested through extensive simulations involving two common hierarchical outcome structures: time-to-event composites, and composites combining time-to-event with quantitative measures. Simulations assessed impacts on estimates, standard errors, and power across prognostic and non-prognostic settings. Results: In RCT data and simulations, covariate adjustment consistently increased power when adjusting for prognostic baseline variables. Gains were comparable to or greater than those in conventional Cox models, with no power loss for non-prognostic covariates. Our ordinal approach performed similarly to existing methods while providing interpretable covariate effect estimates. Adjusting for baseline values of quantitative components yielded power gains according to the baseline-to-follow-up correlation. Conclusions: Covariate adjustment for prognostic variables meaningfully improves efficiency in win ratio analyses for hierarchical outcomes. Our ordinal method is easily implemented and facilitates covariate effect interpretation. We recommend the broader adoption of covariate adjustment and our ordinal method in randomized trials using hierarchical outcomes.

11

Randomized controlled trials do not support efficacy of any of the tested doses of fluvoxamine in prevention of disease progression in adults with incipient non-severe COVID-19 disease: a case-study systematic review and meta-analysis

Trkulja, V.

2026-04-03 pharmacology and therapeutics 10.64898/2026.04.01.26349972 medRxiv

Top 0.2%

1.9%

Show abstract

Background. Recent meta-analyses of randomized controlled trials (RCTs) claimed efficacy of higher-dose fluvoxamine (2 x 100 mg/day, as opposed to 2 x 50 mg/day) in prevention of disease deterioration in adults with mild - moderate COVID-19 disease. Objectives. Investigate whether such claims are supported by the data. Methods. Systematic review and meta-analysis of RCTs evaluating higher-dose fluvoxamine in this indication. Results. Seven studies declared as RCTs were identified, one of which was severely biased (open-label, non-standardized and unreported standard of care as a control), and eventually ended as non-randomized (huge attrition). Composite endpoints of deterioration in the 6 included placebo-controlled trials contained elements susceptible to error and bias. Three trials were small (<100 patients/arm), three were larger (270 - 750 patients/arm). Deaths and need for mechanical ventilation were sporadic and observed in only one trial. Hospitalizations were also sporadic in 5/6 trials. Frequentist methods generally appropriate for random-effects analysis of low number of trials with rare outcomes (generalized linear mixed models, beta-binomial or binomial-normal) greatly underestimated heterogeneity, but still did not document benefits regarding the composite endpoints or hospitalizations. Bayesian hierarchical models revealed huge heterogeneity and indicated no benefit regarding: (i) composites of deterioration, large trials OR = 0.78 (95% CrI 0.55 - 1.21); multiplicity corrected OR = 0.87 (0.64 - 1.21); (ii) hospitalizations, small trials OR = 0.88 (0.45 - 1.72); large trials OR = 0.94 (0.52 - 1.75); all trials OR = 0.81 (0.47 - 1.43). Heterogeneity was unlikely due to clinical particulars (vaccination status, treatment duration, time horizon), and more likely due to unidentified bias. Conclusions. RCTs do not support efficacy of higher-dose fluvoxamine in prevention of disease deterioration in adults with mild - moderate COVID-19 disease.

12

High-Throughput Observational Evidence Generation Using Linked Electronic Health Record and Claims Data

Gombar, S.; Shah, N.; Sanghavi, N.; Coyle, J.; Mukerji, A.; Chappelka, M.

2026-04-07 health informatics 10.64898/2026.04.07.26350300 medRxiv

Top 0.2%

1.9%

Show abstract

Background: The observational literature on comparative effectiveness is expanding rapidly but remains difficult to synthesize. Discordant findings often stem from structural differences in cohort definitions, inclusion criteria, and follow up windows, leaving stakeholders without a cohesive evidence base. Furthermore, studies typically focus on a narrow subset of outcomes, neglecting the broader needs of diverse healthcare stakeholders 1,2,3,4. Methods We developed a high throughput evidence generation workflow using linked EHR and administrative claims data. The cornerstone is a prespecified measurement architecture applied uniformly across clinical scenarios: six post index windows (acute to two year follow.up); 28 Elixhauser comorbidities; 14 healthcare resource utilization (HCRU) categories; 29 laboratory measures with 52 binary thresholds; and 42 adverse event categories. We generated unadjusted treatment comparisons across ~1,038 outcomes per scenario, including effect-measure modification (EMM) assessments across 130 baseline features. Results Across 40 clinical domains, the workflow produced approximately 32,982,552 outcome evaluations. An evaluation included a treatment comparison outcome population effect estimate with uncertainty bounds and supporting diagnostics. Approximately 5,000 narrative summaries underwent structured clinical and statistical quality control before dissemination. Conclusions Standardized, high throughput workflows can shift evidence generation away from fragmented studies toward comprehensive evidence packages. This shared evidence base supports precision medicine by making treatment effect heterogeneity visible across clinically meaningful subpopulations, reducing the need for redundant, stakeholder-specific studies.

13

Protocol for LLM-Generated CONSORT Report for Increased Reporting: A Parallel-Arm Randomized Controlled Trial (Protocol)

Krauska, A. N.; Rohe, K.

2026-04-17 health policy 10.64898/2026.04.15.26350926 medRxiv

Top 0.2%

1.9%

Show abstract

Background Randomized controlled trials (RCTs) often have incomplete methods reporting despite widespread adoption of the CONSORT guideline. The editorial process is supposed to detect these shortcomings and request clarifications from authors, which is time-consuming. We developed an LLM-based CONSORT Rohe Nordberg Report that highlights which CONSORT items appear fully or partially reported and checks page references claimed by authors, and then creates follow up questions for authors to more easily correct missing information. Methods This parallel-arm, superiority RCT will randomize eligible RCT submissions (after desk screening) 1:1 into intervention (editorial team and authors receive the Rohe Nordberg Report) or control (standard editorial review only). The primary outcome is whether manuscripts improve their reporting of CONSORT items in the Methods and Results sections between the original submission and first revision. This will be assessed by blinded human reviewers who evaluate the textual changes for improvements between the original and revised manuscripts for each relevant CONSORT item. Secondary outcomes include time to editorial decisions, rejection and non-resubmission rates, if authors can correctly identify where CONSORT items are reported, and extent of revisions. Human evaluators will be blinded to whether the manuscript was in the intervention or control group. Discussion By providing authors and the editorial team with specific follow up questions for each underreported CONSORT item, we hypothesize that basic underreporting will be more efficiently detected and corrected. Using blinded human reviewers as the primary outcome assessors ensures a rigorous, unbiased evaluation. If successful, this approach may help align manuscripts more closely with CONSORT standards, ultimately benefiting evidence synthesis.

14

Diagnostic Accuracy of Large Language Models for Rare Diseases: A Systematic Review and Meta-Analysis

Nguyen, M.-H.; Yang, C.-T.; Cassini, T. A.; Ma, F.; Hamid, R.; Bastarache, L.; Peterson, J. F.; Xu, H.; Li, L.; Ma, S.; Shyr, C.

2026-03-27 genetic and genomic medicine 10.64898/2026.03.26.26349194 medRxiv

Top 0.2%

1.8%

Show abstract

Background: Large language models (LLMs) have been evaluated as tools to assist rare disease diagnosis, yet evidence on their accuracy remains fragmented. We conducted a systematic review and meta-analysis to synthesize the available evidence on the diagnostic performance of LLMs, identify sources of heterogeneity, and evaluate the current evidence base for clinical translation. Methods: We searched PubMed, Embase, Web of Science, Cochrane Library, arXiv, and medRxiv (January 2020-February 2026). Full-text articles and preprints were considered for inclusion. Eligible studies applied LLM-based systems to generate differential diagnoses for rare diseases and provided Recall@1 (R@1; proportion with the correct diagnosis ranked first). We pooled R@1 using Freeman-Tukey double arcsine transformation with DerSimonian-Laird random-effects models. Pre-specified subgroup analyses examined LLM knowledge augmentation strategy and input modality. Because both retained high residual heterogeneity, we conducted a post-hoc exploratory analysis of evaluation benchmark disease composition, mapping diseases from major benchmarks to Orphanet prevalence classifications. Risk of bias was assessed using a modified QUADAS-3 instrument. Findings: We identified 902 records, of which 564 were screened and 15 studies were eligible. These 15 studies contributed 19 system-dataset entries to the meta-analysis (total N=39,529 cases). The pooled R@1 was 43.3% (95% CI 35.1-51.6; I2=99.6%). Augmented LLM systems (agent-based reasoning, retrieval, or fine-tuning; k=8) achieved R@1 of 52.5% (42.0-62.9) versus 35.4% (30.6-40.4) for standalone LLMs (k=11; p=0.004). Post-hoc exploratory analysis indicated that evaluation benchmark disease composition was associated with differences in diagnostic performance: R@1 was lower on the Phenopacket Store dataset, which contained a higher proportion of ultra-rare diseases (52.8%; k=2), than on RareBench (29.3%; k=6) at 21.7% (18.2-25.5) versus 52.0% (40.7-63.2; p<0.001). All 19 system-dataset entries were assessed to be at high risk of bias, most commonly due to potential data leakage and limited reproducibility. No study provided prospective clinical validation. Interpretation: Diagnostic performance of LLM-based systems for rare diseases varied substantially across evaluation benchmarks. Post-hoc exploratory analysis indicated that performance was associated with benchmark disease composition. Performance was higher in benchmarks containing fewer ultra-rare diseases and in systems incorporating external knowledge at inference time. However, all included studies were at high risk of bias, and none reported prospective clinical validation. These findings highlight the need for prevalence-stratified evaluation benchmarks and independent prospective studies before clinical deployment. Funding: This work was supported in part by the National Institutes of Health Common Fund, grant 15-HG-0130 from the National Human Genome Research Institute, U01NS134349 from the National Institute of Neurological Disorders and Stroke, R00LM014429 from the National Library of Medicine, and the Potocsnak Center for Undiagnosed and Rare Disorders.

15

Implementation of Human-in-the-Loop ChatGPT-based Patient Screening Across Multiple Diverse Clinical Trials

Dohopolski, M.; Esselink, K.; Desai, N.; Grones, B.; Patel, T.; Jiang, S.; Peterson, E.; Navar, A. M.

2026-03-27 health informatics 10.64898/2026.03.20.26348890 medRxiv

Top 0.3%

1.7%

Show abstract

Purpose: Manual screening for trial eligibility is inefficient and costly. We prospectively evaluated a large language model (LLM)-assisted prescreening workflow across multiple active trials. Methods: We deployed a retrieval-augmented generation LLM-based pipeline across multiple trials at an academic medical center. Structured electronic health record data and free-text notes were used by the LLM to classify each criterion as either met, likely met, likely not met, not met, uncertain, or no documentation found, with accompanying rationale. Coordinators were provided a sorted patient list based on LLM-derived eligibility and reviewed each case, documenting their assessment of individual criteria and final prescreening status (success vs failure). Criterion-level performance--accuracy, sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and F1 score--was calculated and tracked over time. Patient prescreening status was also evaluated as a function of the percentage of individual AI criteria met (60--80% and [≥]80%). Results: From October 2024--September 2025, 39,182 patients were prescreened using the LLM workflow across 26 studies (21 oncology and 5 non-oncology), encompassing 112 distinct criteria. A total of 914 patients with high likelihood of eligibility underwent coordinator review (5,096 criteria evaluated). Aggregated criterion-level performance was as follows: accuracy 0.94 (95% CI, 0.92--0.96), sensitivity 0.98 (0.97--0.99), specificity 0.81 (0.71--0.88), PPV 0.95 (0.92--0.97), NPV 0.93 (0.90--0.95), and F1 score 0.97 (0.95--0.97). Twenty-seven criteria prompts across 14/26 trials were automatically updated based on coordinator feedback. Patients with [≥]80% of AI-labeled criteria classified as met or likely met were more likely to be reviewed by coordinators (544/987, 55.1% vs 372/397, 93.7%) and more likely to be labeled as prescreening successes (104/544, 19.1% vs 162/372, 43.5%) compared to those with 60--80%. The average cost was $0.12 per patient. Conclusion: An LLM-assisted, human-in-the-loop prescreening workflow demonstrated high criterion-level performance at low cost across a diverse set of actively enrolling clinical trials. Structured coordinator feedback enabled an automated learning system, improving screening efficiency while preserving necessary human oversight.

16

Multi-Task Learning and Soft-Label Supervision for Psychosocial Burden Profiling in Cancer Peer-Support Text

Wang, Z.; Cao, Y.; Shen, X.; Ding, Z.; Liu, Y.; Zhang, Y.

2026-04-04 health informatics 10.64898/2026.04.03.26350034 medRxiv

Top 0.4%

1.3%

Show abstract

Objective: Online cancer peer-support text contains signals of psychosocial burden beyond emotional tone, including treatment burden, financial strain, uncertainty, and unmet support needs. We evaluated 2 modeling extensions: multi-task learning (MTL) for joint prediction of health economics and outcomes research (HEOR) burden dimensions, and soft-label supervision using large language model (LLM)-derived probability distributions. Materials and Methods: We analyzed 10,392 cancer peer-support posts. GPT-4o-mini generated proxy annotations for HEOR burden subscales, composite burden, high-need status, speaker role, cancer type, and emotion probabilities. Study 1 trained a shared ALBERT encoder under 4 MTL conditions: composite and subscale burden targets, each with and without auxiliary heads, using Kendall uncertainty weighting. Study 2 compared soft-label training on LLM emotion distributions with hard-label baselines under regular and token-augmented inputs, evaluating performance against both human labels and AI distributions. Results: Composite-only MTL achieved R2=0.446 for burden regression and weighted F1=0.810 for high-need screening; subscale classification achieved mean weighted F1=0.646. Adding auxiliary role and cancer-type heads reduced regression performance ({triangleup}R2 = -0.209). Soft-label training reduced weighted F1 by 0.16 versus hard-label baselines (0.68 vs. 0.86), and token augmentation did not improve performance under soft supervision. Discussion: Composite-only MTL supported modeling of multidimensional burden-related signals from forum text, whereas auxiliary prediction heads appeared to compete with primary tasks. Soft-label training aligned poorly with human-labeled emotion categories, suggesting that uncalibrated LLM distributions may propagate bias rather than improve supervision. Conclusion: Composite-only MTL was the strongest burden-modeling approach, and hard-label supervision remained preferable for emotion classification.

17

The Clinician Model Card: development and evaluation of clinician-centered documentation for AI-based clinical decision support

Agha-Mir-Salim, L.; Frey, N.; Kaiser, Z.; Mosch, L.; Weicken, E.; Freyer, O.; Ma, J.; Mittermaier, M.; Meyer, A.; Gilbert, S.; Muller-Birn, C.; Balzer, F.

2026-04-17 health informatics 10.64898/2026.04.15.26350930 medRxiv

Top 0.4%

1.3%

Show abstract

AI documentation frameworks remain poorly designed for point-of-care use, leaving clinicians without actionable information on how to use clinical AI models when they need it most. We developed the Clinician Model Card, an interactive, clinician-centered documentation tool, and evaluated it in a sequential exploratory mixed-methods study: interviews with 12 physicians informed iterative co-design, evaluated in a national survey of 129 physicians across Germany. The tool was well-received: 84% agreed it should be routinely available, and 66% considered its content relevant to clinical decision-making. Yet comprehensibility of statistical performance metrics remained poor despite targeted interventions: only 32% understood the Validation & Performance section well, and fewer than 54% correctly interpreted AUROC or PPV, with AI literacy as strong predictor of comprehension. We propose empirically derived design principles for clinician-centered AI documentation. Effective AI transparency requires not only clinician-friendly design and workflow integration, but sustained investment in AI literacy.

18

Planned egg freezing over 15 years: return to treatment and success rates in Australia and New Zealand

Fitzgerald, O.; Keller, E.; Illingworth, P.; Lieberman, D.; Peate, M.; Kotevski, D.; Paul, R.; Rodino, I.; Parle, A.; Hammarberg, K.; Copp, T.; Chambers, G. M.

2026-04-11 epidemiology 10.64898/2026.04.07.26350362 medRxiv

Top 0.5%

0.9%

Show abstract

Study questionWhat are the characteristics and treatment outcomes of women who undertook planned egg freezing (PEF) in Australia and New Zealand between 2009 and 2023? Summary answerThere has been an average yearly increase in the uptake of PEF of 35%, with most women undergoing a single PEF procedure in their mid-thirties. Given ten years follow-up a little over one in four women return, with nearly half of those using donor sperm and one-third achieving a live birth. What is known alreadyPEF, where women freeze their eggs as a strategy to preserve fertility, has increased dramatically in high income countries in the last decade. Despite the rapid uptake of PEF, there remains limited information to guide women, clinicians and policy makers regarding the characteristics of women undertaking this procedure and treatment outcomes. Study design, size, durationA retrospective population-based cohort study of all women who undertook PEF in Australia and New Zealand between 2009 and 2023, including their subsequent return to thaw their eggs and treatment outcomes. Where women returned to utilise their eggs, all subsequent embryo transfer procedures were linked enabling calculation of live birth rates per woman. Participants/materials, setting, methods20,209 women who undertook PEF in Australia and New Zealand between 2009 and 2023 including 1,657 women who returned to thaw their eggs. Main results and the role of chanceThere has been a huge increase in uptake of PEF, from 55 women in 2009 to 4,919 in 2023. Women who freeze their eggs are typically aged 34-38 years (interquartile range) and nulliparous (98.6%). For women with at least 10 years follow-up (i.e. undertook PEF in 2009-13; N=514), 27.9% returned and thawed their frozen eggs (average time to return: 4.9 years). This reduced to 22.1% in those with at least 5 years follow-up (i.e. undertook PEF in 2009-2018; N=4,288). Of those who used their frozen eggs, 47% used donor sperm. After at least two years follow up, 33.9% had a live birth, rising over time to 37.8% for eggs thawed between 2019-2021. Limitations, reasons for cautionIn the timeframe 2009-2019 we did not have information on whether egg freezing occurred because of a cancer diagnosis, a cohort we wished to exclude from the study. As a result, for this timeframe we weighted observations by the probability that egg freezing occurred due to cancer, with the prediction model developed on the years 2020-2023. Wider implications of the findingsThis study provides recent and comprehensive data on PEF to guide prospective patients and clinicians and inform policy. The exponential growth in PEF in Australia and New Zealand mirrors trends in other high-income countries, suggesting a doubling time of 2-3 years. Study findings highlight the need for setting realistic expectations about the likelihood of returning to use frozen eggs and live birth rates. Study funding/competing interest(s)2020-2025 MRFF Emerging Priorities and Consumer Driven Research initiative: EPCD000014

19

Improving walking after lumbar spinal stenosis surgery: co-design and single-arm feasibility trial of the STructured Rehabilitation and InDividualised Exercise and Education (STRIDE) programme

McIlroy, S.; Bearne, L.; McCarter, A.; McPherson, C.; Chaplin, H.; Brighton, L. J.; Weinman, J.; Norton, S.

2026-03-31 rehabilitation medicine and physical therapy 10.64898/2026.03.28.26349602 medRxiv

Top 0.5%

0.9%

Show abstract

Background: Lumbar spinal stenosis (LSS) can cause pain and severe walking limitation. Although surgery aims to improve walking, many patients do not achieve clinically meaningful gains. Rehabilitation can improve outcomes, yet existing programmes lack robust evidence and theoretical underpinning. This study aimed to (1) co-design a theory-informed rehabilitation programme to improve walking after LSS surgery, and (2) evaluate feasibility of conducting a future trial and acceptability of the intervention. Methods: A multi-methods study included intervention co-design followed by a single-arm feasibility study. Co-design used an adapted Experience-Based Co-Design approach with patients, carers, and healthcare professionals (n=39), integrating the Behaviour Change Wheel. This resulted in STructured Rehabilitation and InDividualised Exercise and Education (STRIDE), delivered over 12-week pre- and 12-weeks post-surgery, targeting knowledge, expectations, perceived control, physical capability, and fears. Adults aged [≥]50 years awaiting LSS surgery were recruited to a before-after feasibility study. Feasibility outcomes included recruitment and retention. Acceptability was assessed using the Theoretical Framework of Acceptability questionnaire (0-5 (high acceptability)) and focus groups. Clinical outcomes measured at baseline, post-prehabilitation, and post-rehabilitation included 6-minute walk distance (6MWD) and mean daily step count over 7 days. Results: Fifteen of 31 eligible participants were recruited (48%; mean age 70 years), with 80% retained to study end (2 decided against surgery, 1 unable to complete final assessment). Acceptability was high (median 5/5, IQR 0). Participants valued the personalised, supportive approach and reported improved motivation and preparation for surgery, though travel was burdensome. Small pre-operative and moderate-to-large post-operative improvements were observed in 6MWD (+49.9 m and +81.6 m) and daily step count (+868 and +1405 steps/day). Conclusions: This co-designed, physiotherapy-led, behaviour-change rehabilitation programme was acceptable to participants, with encouraging recruitment, retention, and signals of improved walking following LSS surgery. The findings support progression to a future trial.

20

Assessing the efficacy of behaviourally informed invitation messaging in increasing attendance at the NHS Targeted Lung Health Check: A randomised experimental study

Tan, X.; Danka, M. N.; Urbanski, S.; Kitsawat, P.; McElvaney, T. J.; Jundi, S.; Porter, L.; Gericke, C.

2026-04-24 public and global health 10.64898/2026.04.12.26350693 medRxiv

Top 0.5%

0.9%

Show abstract

Background: Lung cancer screening can reduce lung cancer mortality through early detection, but uptake of the NHS Targeted Lung Health Check (TLHC) programme remains low. Behaviourally informed invitation messages have been proposed as a low-cost approach to increase attendance, but evidence of their effectiveness in lung cancer screening is mixed. Few intervention studies used evidence-based behaviour change frameworks, and rarely tailored invitation strategies to empirically identified barriers and enablers. Methods: In an online experiment, 3,274 adults aged 55-74 years and with a history of smoking were randomised to see one of four behaviourally informed invitation messages or a control message. Participants then rated their intention to attend a TLHC appointment, and selected barriers and enablers to attending from a pre-defined list, which were classified according to the Theoretical Domains Framework. Invitation messages were mapped to Behaviour Change Techniques using the Theory and Techniques Tool. Message conditions were compared on intention to attend TLHC using bootstrapped ANOVA followed by pairwise comparisons. Exploratory counterfactual mediation analyses examined the role of fear in intention to attend. Results: Behaviourally informed invitation messages did not meaningfully increase intention to attend TLHC compared with the control message. While a GP-endorsed message showed a small potential benefit relative to the other conditions, this finding was not robust after adjustment for multiple comparisons. Participants most frequently reported barriers related to Emotion (particularly fear), Social Influence, and Knowledge, while Beliefs about Consequences emerged as the primary enabler of attendance. Only around half of reported barriers and enablers were addressed by the invitation messages. Exploratory analyses found that fear was associated with lower intention to attend a TLHC appointment, yet none of the behaviourally informed messages appeared to reduce fear compared to the control message. Conclusions: Improving lung cancer screening uptake will likely require invitation messages that directly address emotional concerns, particularly fear, alongside credible recommendations. These findings highlight the importance of systematically aligning invitation message content with empirically identified behavioural influences when designing scalable interventions to improve lung cancer screening uptake.